This guide demonstrates how to perform functional clustering analysis on one-dimensional time series data using the MultiConnector package. We use ovarian cancer cell growth data as a case study to illustrate the complete workflow from data import to biological interpretation. The analysis identifies distinct growth patterns and relates them to cellular progeny information, providing insights into cancer cell behavior and potential therapeutic targets.


Introduction

Background

Functional clustering is a powerful statistical method for analyzing time series data where the goal is to group curves based on their shape and temporal patterns rather than individual time points. This approach is particularly valuable in biological and medical research where understanding distinct temporal patterns can reveal important insights about:

  • Disease progression patterns
  • Treatment response profiles
  • Cellular growth dynamics
  • Biomarker trajectories

The MultiConnector Package

MultiConnector implements advanced functional clustering methods based on the Sugar & James model, which:

  • Projects curves onto lower-dimensional spaces using spline coefficients
  • Accounts for within-curve correlation and measurement error
  • Provides robust clustering solutions through multiple random initializations
  • Offers comprehensive visualization and validation tools

Case Study: Ovarian Cancer Cell Growth

In this guide, we analyze ovarian cancer cell line growth curves to:

  1. Identify distinct growth patterns in cancer cell populations
  2. Relate patterns to progeny information to understand cellular heterogeneity
  3. Demonstrate the complete analysis workflow for one-dimensional functional clustering

Getting Started

Installation and Setup

# Load required libraries
library(dplyr)           # Data manipulation
library(parallel)        # Parallel computing
library(MultiConnector)  # Main clustering package
library(ggplot2)         # Enhanced plotting
library(knitr)           # Table formatting
library(kableExtra)      # Enhanced table styling

# Set up parallel processing
n_cores <- detectCores()
workers <- max(1, n_cores - 1)  # Leave one core free

cat("System Information:\n")
#> System Information:
cat("- Available CPU cores:", n_cores, "\n")
#> - Available CPU cores: 8
cat("- Cores used for analysis:", workers, "\n")
#> - Cores used for analysis: 7

Data Requirements

For one-dimensional functional clustering, your data should include:

  • Time series data: With columns subjID, measureID, time, value
  • Annotation data: With subjID and feature columns (e.g., treatment groups, demographics)
  • Consistent identifiers: Matching IDs between time series and annotations

Data Import and Preparation

Loading the Dataset

We begin by loading ovarian cancer cell growth data, which contains:

  • 475 cell growth curves measured over time
  • Progeny information indicating cellular lineage
  • Growth measurements at multiple time points

Creating the CONNECTORData Object

The ConnectorData() function validates and structures the data for analysis:

system.file("Data/OvarianCancer/Ovarian_TimeSeries.xlsx", package="MultiConnector") -> time_series_path
system.file("Data/OvarianCancer/Ovarian_Annotations.txt", package="MultiConnector") -> annotations_path
# Create the main data object
Data <- ConnectorData(time_series_path,annotations_path)
#> ############################### 
#> ######## Summary ##############
#> 
#>  Number of curves:# A tibble: 1 × 1
#>   nTimePoints
#>         <int>
#> 1          21
#> ;
#>  Min curve length: # A tibble: 1 × 1
#>   nTimePoints
#>         <int>
#> 1           7
#> ; Max curve length: # A tibble: 1 × 1
#>   nTimePoints
#>         <int>
#> 1          18
#> ###############################

Initial Data Exploration

Understanding your data structure is crucial before clustering. We examine:

  1. Overall growth patterns
  2. Feature relationships
  3. Time point distributions

Basic Time Series Visualization

# Plot 1: Basic time series overview
p1 <- plot(Data) + 
  ggtitle("A) All Growth Curves") +
  theme_minimal()

# Plot 2: Colored by progeny feature
p2 <- plot(Data, feature = "Progeny") + 
  ggtitle("B) Curves by Progeny Type") +
  theme_minimal()

# Combine plots if possible (requires gridExtra or patchwork)
if (requireNamespace("gridExtra", quietly = TRUE)) {
  gridExtra::grid.arrange(p1, p2, ncol = 2)
} else {
  print(p1)
  print(p2)
}
Initial data exploration: (A) All growth curves overlaid, (B) Curves colored by progeny type.

Initial data exploration: (A) All growth curves overlaid, (B) Curves colored by progeny type.

Time Point Distribution Analysis

# Analyze time point distributions
plotTimes(Data, large = TRUE)
Time point distribution analysis showing data density across the measurement period.

Time point distribution analysis showing data density across the measurement period.

Key Observations

From the initial exploration, we can observe:

  • Growth curve diversity: Multiple distinct patterns are visible
  • Feature associations: Some progeny types may show preferred growth patterns
  • Data completeness: Time point coverage affects analysis quality

Data Preprocessing

Truncation Analysis

Many time series datasets have sparse data at later time points. Truncation can improve clustering stability by focusing on well-sampled regions.

# Analyze truncation effects
truncatePlot(Data, measure = "Ovarian", truncTime = 70)
Truncation analysis helping to identify optimal cutoff time for maintaining data quality.

Truncation analysis helping to identify optimal cutoff time for maintaining data quality.

# Apply truncation based on analysis
DataTrunc <- truncate(Data, measure = "Ovarian", truncTime = 70)
#> ############################### 
#> ######## Summary ##############
#> 
#>  Number of curves:# A tibble: 1 × 1
#>   nTimePoints
#>         <int>
#> 1          21
#> ;
#>  Min curve length: # A tibble: 1 × 1
#>   nTimePoints
#>         <int>
#> 1           7
#> ; Max curve length: # A tibble: 1 × 1
#>   nTimePoints
#>         <int>
#> 1          16
#> ###############################
# Visualize truncated data
plot(DataTrunc) + 
  ggtitle("Growth Curves After Truncation (t ≤ 70)") +
  theme_minimal()
Data after truncation at time = 70, showing improved data density.

Data after truncation at time = 70, showing improved data density.


Parameter Estimation

Spline Dimension Selection

The spline dimension (p) parameter controls curve flexibility:

  • Higher p: More flexible curves, risk of overfitting
  • Lower p: Smoother curves, may miss important features
# Estimate optimal spline dimension
CrossLogLikePlot <- estimatepDimension(DataTrunc, p = 2:6, cores = workers)

# Display results
print(CrossLogLikePlot)
Cross-validation results for spline dimension selection showing optimal p value.

Cross-validation results for spline dimension selection showing optimal p value.

# Set optimal value (typically where CV error is minimized)
optimal_p <- 3
cat("Selected optimal p =", optimal_p, "\n")
#> Selected optimal p = 3

Parameter Selection Guidelines

p Value Characteristics Best For
2-3 Smooth, simple curves Linear/quadratic patterns
4-5 Moderate flexibility Complex but stable patterns
6+ High flexibility Complex curves (risk overfitting)

Clustering Analysis

Comprehensive Clustering

We test multiple cluster numbers to find the optimal solution:

# Perform clustering analysis
# Note: This is computationally intensive
clusters <- estimateCluster(
  DataTrunc, 
  G = 2:6,              # Test 2-6 clusters
  p = optimal_p,        # Use optimal spline dimension
  runs = 20,            # Reduced for demonstration (use 100+ for final analysis)
  cores = workers       # Parallel processing
)
#> [1] "Total time: 11.44 secs"
# Display quality metrics
plot(clusters)
Clustering quality metrics across different numbers of clusters (G).

Clustering quality metrics across different numbers of clusters (G).

Quality Metrics Interpretation

  • fDB (functional Data Depth): Lower values indicate more compact, well-separated clusters
  • Silhouette Score: Higher values (closer to 1) indicate better cluster quality
  • Stability: Consistent results across multiple runs indicate robust solutions

Cluster Selection

Based on quality metrics, we select the optimal configuration:

# Select optimal clustering (G=4 based on quality metrics)
ClusterData <- selectCluster(clusters, G = 4, "MinfDB")

Selected Configuration:

  • Number of clusters: 4
  • Selection criterion: Minimum fDB
  • This represents the most compact and well-separated clustering

Results Visualization and Interpretation

Basic Cluster Visualization

# Plot clusters
p1 <- plot(ClusterData) + 
  ggtitle("A) Clusters by Assignment") +
  theme_minimal()

# Plot by progeny feature
p2 <- plot(ClusterData, feature = "Progeny") + 
  ggtitle("B) Clusters by Progeny Type") +
  theme_minimal()

# Display plots
if (requireNamespace("gridExtra", quietly = TRUE)) {
  gridExtra::grid.arrange(p1, p2, ncol = 2)
} else {
  print(p1)
  print(p2)
}
Cluster visualization: (A) Growth curves colored by cluster assignment, (B) Curves colored by progeny type to examine biological associations.

Cluster visualization: (A) Growth curves colored by cluster assignment, (B) Curves colored by progeny type to examine biological associations.

Cluster-Feature Association Analysis

# Examine cluster-annotation relationships
annotations_summary <- getAnnotations(ClusterData)
print(annotations_summary)
#> [1] "IDSample"     "Progeny"      "Source"       "Real.Progeny"
# Create summary table if annotations exist
if (exists("annotations_summary") && length(annotations_summary) > 0) {
  kable(annotations_summary, 
        caption = "Cluster-annotation summary showing the distribution of features across clusters.") %>%
    kable_styling(bootstrap_options = c("striped", "hover"))
}
Cluster-annotation summary showing the distribution of features across clusters.
x
IDSample
Progeny
Source
Real.Progeny

Cluster Validation

Quality Assessment

Comprehensive validation ensures clustering reliability:

# Perform validation analysis
Metrics <- validateCluster(ClusterData)

# Display validation plots
print(Metrics$plot)
Cluster validation metrics: (A) Silhouette analysis showing how well samples fit their clusters, (B) Entropy analysis measuring assignment uncertainty.

Cluster validation metrics: (A) Silhouette analysis showing how well samples fit their clusters, (B) Entropy analysis measuring assignment uncertainty.

Validation Metrics Interpretation

  • Silhouette Analysis:
    • Values close to 1: Well-assigned samples
    • Values near 0: Borderline assignments
    • Negative values: Potentially misclassified samples
  • Entropy Analysis:
    • Low entropy: Confident cluster assignments
    • High entropy: Uncertain assignments requiring investigation

Advanced Visualizations

Discriminant Analysis

Discriminant plots show clusters in reduced dimensional space:

# Generate discriminant plots
Discr <- DiscriminantPlot(ClusterData)
#> [1] "Percentage of variance explained:"
#> [1] 91.731111  5.665049  2.603840
#> [1] "Sum of first two components: 97.4"
# Display cluster-colored plot
print(Discr$ColCluster)

# Display feature-colored plot (if features exist)
if ("ColFeature" %in% names(Discr)) {
  print(Discr$ColFeature)
}

Spline-Based Cluster Representation

# Generate spline plots
splinePlots <- splinePlot(ClusterData)

# Display the main spline plot
if (length(splinePlots) > 0) {
  print(splinePlots[[1]])
}
Spline-based visualization showing the characteristic curve shape for each cluster.

Spline-based visualization showing the characteristic curve shape for each cluster.

Maximum Discrimination Analysis

# Identify most discriminative features
MaximumDiscriminationFunction(ClusterData)
#> $DiscrFunctionsPlot

#> 
#> $Separated


Biological Interpretation

Growth Pattern Analysis

Based on our clustering results, we can identify distinct growth patterns:


Conclusions and Next Steps

Summary of Findings

Our functional clustering analysis revealed:

  1. Four distinct growth patterns in ovarian cancer cell populations
  2. Associations between growth patterns and progeny types (if significant)
  3. Robust clustering solution validated through multiple quality metrics

Methodological Insights

The MultiConnector package successfully:

  • Handled one-dimensional functional data
  • Provided stable clustering solutions
  • Offered comprehensive visualization tools
  • Enabled biological interpretation

Session Information

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-apple-darwin23.4.0
#> Running under: macOS 15.5
#> 
#> Matrix products: default
#> BLAS:   /usr/local/Cellar/openblas/0.3.28/lib/libopenblasp-r0.3.28.dylib 
#> LAPACK: /usr/local/Cellar/r/4.4.1/lib/R/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: Europe/Rome
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] parallel  stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#> [1] kableExtra_1.4.0   knitr_1.50         ggplot2_3.5.2      dplyr_1.1.4       
#> [5] MultiConnector_1.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] gridExtra_2.3       remotes_2.5.0       readxl_1.4.5       
#>  [4] rlang_1.1.6         magrittr_2.0.3      compiler_4.4.1     
#>  [7] roxygen2_7.3.2      systemfonts_1.2.3   vctrs_0.6.5        
#> [10] stringr_1.5.1       profvis_0.4.0       pkgconfig_2.0.3    
#> [13] crayon_1.5.3        fastmap_1.2.0       MetBrewer_0.2.0    
#> [16] ellipsis_0.3.2      magic_1.6-1         labeling_0.4.3     
#> [19] promises_1.3.3      rmarkdown_2.29      sessioninfo_1.2.3  
#> [22] tzdb_0.5.0          purrr_1.1.0         bit_4.6.0          
#> [25] xfun_0.53           cachem_1.1.0        jsonlite_2.0.0     
#> [28] gghalves_0.1.4      later_1.4.4         R6_2.6.1           
#> [31] bslib_0.9.0         stringi_1.8.7       RColorBrewer_1.1-3 
#> [34] rlist_0.4.6.2       pkgload_1.4.0       jquerylib_0.1.4    
#> [37] cellranger_1.1.0    Rcpp_1.1.0          usethis_3.1.0      
#> [40] readr_2.1.5         httpuv_1.6.16       Matrix_1.7-3       
#> [43] splines_4.4.1       tidyselect_1.2.1    rstudioapi_0.17.1  
#> [46] dichromat_2.0-0.1   abind_1.4-8         yaml_2.3.10        
#> [49] codetools_0.2-20    miniUI_0.1.2        pkgbuild_1.4.8     
#> [52] lattice_0.22-7      tibble_3.3.0        shiny_1.11.1       
#> [55] withr_3.0.2         evaluate_1.0.4      desc_1.4.3         
#> [58] isoband_0.2.7       urlchecker_1.0.1    xml2_1.4.0         
#> [61] pillar_1.11.0       geometry_0.5.2      plotly_4.11.0      
#> [64] generics_0.1.4      vroom_1.6.5         rprojroot_2.1.1    
#> [67] hms_1.1.3           commonmark_2.0.0    scales_1.4.0       
#> [70] xtable_1.8-4        RhpcBLASctl_0.23-42 glue_1.8.0         
#> [73] lazyeval_0.2.2      tools_4.4.1         data.table_1.16.4  
#> [76] fs_1.6.6            grid_4.4.1          tidyr_1.3.1        
#> [79] crosstalk_1.2.2     devtools_2.4.5      patchwork_1.3.2    
#> [82] cli_3.6.5           textshaping_1.0.1   viridisLite_0.4.2  
#> [85] svglite_2.2.1       gtable_0.3.6        sass_0.4.10        
#> [88] digest_0.6.37       htmlwidgets_1.6.4   farver_2.1.2       
#> [91] memoise_2.0.1       htmltools_0.5.8.1   lifecycle_1.0.4    
#> [94] httr_1.4.7          statmod_1.5.0       mime_0.13          
#> [97] bit64_4.6.0-1       MASS_7.3-65

References

  1. Sugar, C. A., & James, G. M. (2003). Finding the number of clusters in a dataset: An information-theoretic approach. Journal of the American Statistical Association, 98(463), 750-763.

  2. James, G. M., & Sugar, C. A. (2003). Clustering for sparsely sampled functional data. Journal of the American Statistical Association, 98(462), 397-408.

  3. Ramsay, J. O., & Silverman, B. W. (2005). Functional data analysis. Springer.

  4. Ferraty, F., & Vieu, P. (2006). Nonparametric functional data analysis: theory and practice. Springer.


Appendix

Computational Details

  • Analysis performed on: 2025-09-18 16:39:04.612018
  • R version: R version 4.4.1 (2024-06-14)
  • Cores used: 7
  • Total computation time: Varies with dataset size and parameters

Troubleshooting Common Issues

Data Format Problems

  • Ensure column names match requirements (subjID, measureID, time, value)
  • Check for missing values in key columns
  • Verify ID consistency between time series and annotations

Convergence Issues

  • Reduce spline dimension (p) if clustering fails
  • Increase number of runs for stability
  • Consider data truncation to remove sparse regions

Memory and Performance

  • Use fewer clusters or runs for large datasets
  • Utilize parallel processing with cores parameter
  • Consider data subsampling for initial exploration

This guide was generated using the MultiConnector package. For updates and additional resources, visit the package documentation.